Distributionally Robust Policy Gradient for Offline Contextual Bandits

Published in AISTATS, 2023

Recommended citation: Z. Yang, Y. Guo, P. Xu, A. Liu, and A. Anandkumar. Distributionally robust policy gradient for offline contextual bandits. In International Conference on Artificial Intelligence and Statistics, pages 6443–6462. PMLR, 2023.

In this paper, we employ a distributionally robust policy gradient method, DROPO, to account for the distributional shift between the static logging policy and the learning policy in policy gradient. Our approach conservatively estimates the conditional reward distributional and updates the policy accordingly.

Download paper here